Developing a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank
نویسندگان
چکیده
Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools and, above all, supporting computational grammars appear no longer as a matter of convenience but of necessity. In this paper, we report on the design features, the development conditions and the methodological options of a deep linguistic databank, the CINTIL DeepGramBank. In this corpus, sentences are annotated with fully fledged linguistically informed grammatical representations that are produced by a deep linguistic processing grammar, thus consistently integrating morphological, syntactic and semantic information. We also report on how such corpus permits to straightforwardly obtain a whole range of past generation annotated corpora (POS, NER and morphology), current generation treebanks (constituency treebanks, dependency banks, propbanks) and next generation databanks (logical form banks) simply by means of a very residual selection/extraction effort to get the appropriate “views” exposing the relevant layers of information.
منابع مشابه
A PropBank for Portuguese: the CINTIL-PropBank
With the CINTIL-International Corpus of Portuguese, an ongoing corpus annotated with fully flegded grammatical representation, sentences get not only a high level of lexical, morphological and syntactic annotation but also a semantic analysis that prepares the data to a manual specification step and thus opens the way for a number of tools and resources for which there is a great research focus...
متن کاملParsing Paraphrases with Joint Inference
Treebanks are key resources for developing accurate statistical parsers. However, building treebanks is expensive and timeconsuming for humans. For domains requiring deep subject matter expertise such as law and medicine, treebanking is even more difficult. To reduce annotation costs for these domains, we develop methods to improve cross-domain parsing inference using paraphrases. Paraphrases a...
متن کاملIdentifying complex phenomena in a corpus via a treebank lens
While syntactically annotated corpora known as treebanks have been available for many years, along with a variety of customized tools for querying these annotations, the mapping from actual annotations to relevant syntactic or semantic phenomena has been obscured by the coarse-grained labelling of nodes in the parse trees which make up the treebanks. This lack of linguistic detail has hampered ...
متن کاملCINTIL DependencyBank PREMIUM - A Corpus of Grammatical Dependencies for Portuguese
This paper presents a new linguistic resource for the study and computational processing of Portuguese. CINTIL DependencyBank PREMIUM is a corpus of Portuguese news text, accurately manually annotated with a wide range of linguistic information (morpho-syntax, named-entities, syntactic function and semantic roles), making it an invaluable resource specially for the development and evaluation of...
متن کاملEighth International Conference on Language Resources and Evaluation
Evaluating a taxonomy learned automatically against an existing gold standard is a very complex problem, because differences stem from the number, label, depth and ordering of the taxonomy nodes. In this paper we propose casting the problem as one of comparing two hierarchical clusters. To this end we defined a variation of the Fowlkes and Mallows measure (Fowlkes and Mallows, 1983). Our method...
متن کامل